Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.
FALSE Error in setwd("~/612-group/final-project") :
FALSE cannot change working directory
Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.
Once obtained, we prepared our data for our recommender system using the following transformations:
We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.
We applied additional transformation to remove unnessacary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our dataset from outside of AZ that we also removed.
As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:
| business_id | categories | city | name | longitude | state | latitude |
|---|---|---|---|---|---|---|
| usAsSV36QmUej8–yvN-dg | Food, Grocery | Phoenix | Food City | -112.0854 | AZ | 33.39221 |
| PzOqRohWw7F7YEPBz6AubA | Food, Bagels, Delis, Restaurants | Glendale Az | Hot Bagels & Deli | -112.2003 | AZ | 33.71280 |
| qarobAbxGSHI7ygf1f7a_Q | Sandwiches, Restaurants | Gilbert | Jersey Mike’s Subs | -111.8120 | AZ | 33.37884 |
| JxVGJ9Nly2FFIs_WpJvkug | Pizza, Restaurants | Scottsdale | Sauce | -111.9263 | AZ | 33.61746 |
| Jj7bcQ6NDfKoz4TXwvYfMg | Burgers, Restaurants | Phoenix | Fuddruckers | -112.1162 | AZ | 33.56699 |
| JHp5mJvYe6UtM_QsklR-iw | Pizza, Restaurants | Scottsdale | Peter Piper Pizza | -111.9175 | AZ | 33.46613 |
We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:
Last, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.
The dataframe preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.
| user_id | user_name | review_count | votes.funny | votes.useful | votes.cool | average_stars |
|---|---|---|---|---|---|---|
| –lMCM6K8-9NTvPlbCMXEA | Anne Marie | 1 | 0 | 0 | 0 | 4.0 |
| –LzFD0UDbYE-Oho3AhsOg | Shumai | 1 | 0 | 1 | 0 | 4.0 |
| –M-cIkGnH1KhnLaCOmoPQ | Emma | 1 | 2 | 2 | 2 | 5.0 |
| -01H9S7YxFrhRgNdvxmaVQ | Marc | 1 | 0 | 0 | 0 | 5.0 |
| -06LYbA4Qm_9E83KNT1Jrg | Brett | 2 | 0 | 0 | 0 | 4.5 |
| -0Ycl6yN0BsX1U70-SZOYw | Kate | 2 | 0 | 0 | 0 | 4.0 |
Next, we created our main dataframe by merging business and reviews on Business_ID. This dataframe will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.
This dataframe will be referenced later on when building our recommender matrices and algorithms. Review details were omitted in the preview for brevity.
| business_id | categories | city | name | longitude | state | latitude | votes.funny | votes.useful | votes.cool | user_id | review_id | stars | date | userID | itemID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| usAsSV36QmUej8–yvN-dg | Food, Grocery | Phoenix | Food City | -112.0854 | AZ | 33.39221 | 0 | 0 | 0 | 1Eevry0X_8yb6yzsQilptg | F-R4pX3Ane7y3VlswhWrrQ | 3 | 2011-11-20 | 1 | 1 |
| PzOqRohWw7F7YEPBz6AubA | Food, Bagels, Delis, Restaurants | Glendale Az | Hot Bagels & Deli | -112.2003 | AZ | 33.71280 | 0 | 1 | 0 | Iycf9KNRhxvR187Qu2zZHg | hg7rapz_KzAqhoOFYhXVoQ | 4 | 2012-06-11 | 2 | 2 |
| qarobAbxGSHI7ygf1f7a_Q | Sandwiches, Restaurants | Gilbert | Jersey Mike’s Subs | -111.8120 | AZ | 33.37884 | 1 | 0 | 0 | 4UypETvlv8cl0jKFxHh3Zw | OhWvwGTbiuT4tnLpK-iC-w | 2 | 2012-08-27 | 3 | 3 |
| qarobAbxGSHI7ygf1f7a_Q | Sandwiches, Restaurants | Gilbert | Jersey Mike’s Subs | -111.8120 | AZ | 33.37884 | 0 | 1 | 0 | 5j7qmDZTAetaH0yXFnAFyw | rTghOy2OZxdmI6ofRzI0Bg | 3 | 2012-03-09 | 4 | 3 |
| qarobAbxGSHI7ygf1f7a_Q | Sandwiches, Restaurants | Gilbert | Jersey Mike’s Subs | -111.8120 | AZ | 33.37884 | 1 | 2 | 1 | uNbB1uR4EBhmygUc3IfPAw | EY-eYBoXIjn2k2X_ZDTpJA | 4 | 2012-05-10 | 5 | 3 |
| JxVGJ9Nly2FFIs_WpJvkug | Pizza, Restaurants | Scottsdale | Sauce | -111.9263 | AZ | 33.61746 | 0 | 0 | 0 | l_6XDatGLHfkGxl8BjI2Ag | imbU3ZZlDf5SIKHkaEskaw | 5 | 2011-09-22 | 6 | 4 |
Add data visualizations.
We tested recommender algorithms using recommenderlab and sparklyr to see which performed the best on our recommender system data. To test the algorithsm, we first had to create a user-item matrix and then split our data into training and test sets.
Matrix Building
We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.
The matrix data can be viewed below.
# spread data from long to wide format
matrix_data <- df %>% select(userID, itemID, stars) %>% spread(itemID, stars)
# set row names to userid
rownames(matrix_data) <- matrix_data$userID
# remove userid from columns
matrix_data <- matrix_data %>% select(-userID)
# convert to matrix
ui_mat <- matrix_data %>% as.matrix()
# store matrix as realRatingMatrix
ui_mat <- as(ui_mat, "realRatingMatrix")
# view matrix data
matrix_dataTrain and Test Splits
Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 10 k-folds using the recommenderlab package. 80% of data was retained for training and 20% for testing purposes.
# evaluation method with 80% of data for train and 20% for test
set.seed(1000)
evalu <- evaluationScheme(ui_mat, method = "split", train = 0.8, given = 1,
goodRating = 1, k = 10)
# Prep data
train <- getData(evalu, "train") # Training Dataset
dev_test <- getData(evalu, "known") # Test data from evaluationScheme of type KNOWN
test <- getData(evalu, "unknown") # Unknow datset used for RMSE / model evaluationUB <- Recommender(getData(evalu, "train"), "UBCF", param = list(normalize = "Z-score",
method = "Cosine"))
p <- predict(UB, getData(evalu, "known"), type = "ratings")
p@data@x[p@data@x[] < 1] <- 1
p@data@x[p@data@x[] > 5] <- 5
calcPredictionAccuracy(p, getData(evalu, "unknown"))FALSE RMSE MSE MAE
FALSE 1.384047 1.915586 1.018311
IB <- Recommender(getData(evalu, "train"), "IBCF", param = list(normalize = "Z-score",
method = "Cosine"))
p1 <- predict(IB, getData(evalu, "known"), type = "ratings")
p1@data@x[p1@data@x[] < 1] <- 1
p1@data@x[p1@data@x[] > 5] <- 5
calcPredictionAccuracy(p1, getData(evalu, "unknown"))FALSE RMSE MSE MAE
FALSE 1.3710486 1.8797743 0.9719316
Due to the size of our data, we choose to use Spark in R to avoid input/output (I/O) bottleneck issues and maximize the performance speed of our recommender algorithm calculations.
We initiated a local connection with Spark (V2.4.3). Our yelp data was inputted into a spark table and split for training and testing purposes.
# configure spark connection
config <- spark_config()
config$spark.executor.memory <- "8G"
config$spark.executor.cores <- 2
config$spark.executor.instances <- 3
config$spark.dynamicAllocation.enabled <- "false"
# initiate connection
sc <- spark_connect(master = "local", config = config, version = "2.4.3")
# unhash to verify version: spark_version(sc)
# select data for spark and create spark table
spark_data <- df %>% select(stars, user_id, business_id, name, city, categories)
yelp <- sdf_copy_to(sc, spark_data, "yelp", overwrite = TRUE)
# Transform features
yelp <- yelp %>% ft_string_indexer(input_col = "user_id", output_col = "user_index") %>%
ft_string_indexer(input_col = "business_id", output_col = "item_index") %>%
select(-user_id, -business_id) %>% sdf_register("yelp")
# randomly split / train test data
split <- sdf_random_split(yelp, training = 0.8, testing = 0.2, seed = 1)
# store training / test sets
train <- sdf_register(split$training, "train")
test <- sdf_register(split$testing, "test")
# tidy train for algoritms that require only user/item inputs
ui_train <- tbl(sc, "train") %>% select(user_index, item_index, stars)
ui_test <- tbl(sc, "test") %>% select(user_index, item_index, stars)Once connected, we applied the alternating least squares (ALS) for our recommender predictions.
# build model using user/business/ratings
als_fit <- ml_als(ui_train, max_iter = 5, nonnegative = TRUE, rating_col = "stars",
user_col = "user_index", item_col = "item_index")
# predict from the model for the training data
als_predict_train <- ml_predict(als_fit, ui_train) %>% collect()
als_predict_test <- ml_predict(als_fit, ui_test) %>% collect()
# Remove NaN (result of test/train splits - not data)
als_predict_train <- als_predict_train[!is.na(als_predict_train$prediction),
]
als_predict_test <- als_predict_test[!is.na(als_predict_test$prediction), ]
# View results
als_predict_test %>% head %>% kable() %>% kable_styling()| user_index | item_index | stars | prediction |
|---|---|---|---|
| 46 | 12 | 4 | 4.209892 |
| 1534 | 12 | 4 | 3.487417 |
| 3500 | 12 | 5 | 3.960966 |
| 3687 | 12 | 4 | 2.270529 |
| 5 | 12 | 5 | 4.115550 |
| 163 | 12 | 5 | 4.160906 |
Our ALS calculations for RMSE, MSE, and MAE can be viewed below:
# Calculate RMSE/MSE/MAE
als_mse_train <- mean((als_predict_train$stars - als_predict_train$prediction)^2)
als_rmse_train <- sqrt(als_mse_train)
als_mae_train <- mean(abs(als_predict_train$stars - als_predict_train$prediction))
als_mse_test <- mean((als_predict_test$stars - als_predict_test$prediction)^2)
als_rmse_test <- sqrt(als_mse_test)
als_mae_test <- mean(abs(als_predict_test$stars - als_predict_test$prediction))
# View metrics
cbind(als_rmse_train, als_mse_train, als_mae_train)FALSE als_rmse_train als_mse_train als_mae_train
FALSE [1,] 0.3614416 0.13064 0.2490331
FALSE als_rmse_test als_mse_test als_mae_test
FALSE [1,] 1.382284 1.910709 1.096935
We wanted to test out other machine learning options in Spark. We tried using random forest decision trees, however this method yielded very low accuracy on our training data. This method would not produce a good recommender system for businesses or users in our dataset.
# build model using user/business index, category, and city
rf_fit <- ml_random_forest(
train, # the training partion
response = "stars",
features = colnames(train)[2:4],
impurity = "entropy",
type = "classification",
seed = 1)
# identify important features in model
rf_importance1 <- ml_tree_feature_importance(sc = sc, model = rf_fit)
rf_importance2<- rf_importance1 %>% mutate(importance = round(importance,2)) %>% filter(importance>0)
# percent of terms found important
nrow(rf_importance2)/nrow(rf_importance1) FALSE [1] 0.02225392
| feature | importance |
|---|---|
| name_Arriba Mexican Grill | 0.06 |
| name_Cornish Pasty Company | 0.05 |
| name_RA Sushi Bar Restaurant | 0.04 |
| name_Walmart | 0.03 |
| name_Sauce | 0.02 |
| categories_Department Stores, Grocery, Fashion, Shopping, Food, Mobile Phones | 0.02 |
| name_Green | 0.02 |
| categories_Bars, Restaurants, American (Traditional), Sports Bars, Nightlife | 0.02 |
| categories_Food, Beer, Wine & Spirits | 0.02 |
| name_Z’Tejas Southwestern Grill | 0.02 |
| name_Applebee’s Neighborhood Grill & Bar | 0.02 |
| categories_American (New), Barbeque, Restaurants | 0.02 |
| name_The Coffee Shop | 0.02 |
| categories_Bars, Food, Breweries, Pubs, Nightlife, American (New), Restaurants | 0.02 |
| categories_Nightlife, Bars, Sushi Bars, Japanese, Restaurants | 0.02 |
| categories_Italian, Pizza, Sandwiches, Restaurants | 0.02 |
| name_32 SHEA | 0.01 |
| name_Pomegranate Cafe | 0.01 |
| categories_Food, Convenience Stores | 0.01 |
| name_McDonald’s | 0.01 |
| name_Postino Arcadia | 0.01 |
| name_RnR | 0.01 |
| name_Four Peaks Brewing Co | 0.01 |
| name_Cibo | 0.01 |
| name_Café Monarch | 0.01 |
| name_Great Steak and Potato Co. | 0.01 |
| categories_Food, Coffee & Tea | 0.01 |
| name_Ground Control At Verrado | 0.01 |
| name_Short Leash Dogs | 0.01 |
| categories_Latin American, Restaurants | 0.01 |
| city_Buckeye | 0.01 |
| name_Genghis Grill | 0.01 |
| city_Mesa | 0.01 |
| categories_Asian Fusion, Buffets, Chinese, Restaurants | 0.01 |
| categories_Fast Food, Sandwiches, Restaurants | 0.01 |
| categories_Pubs, Bars, Nightlife, British, Restaurants | 0.01 |
| categories_Nightlife, Bars, Pizza, Sports Bars, Restaurants | 0.01 |
| categories_Arts & Entertainment, American (Traditional), Arcades, Restaurants | 0.01 |
| name_Lo-Lo’s Chicken & Waffles | 0.01 |
| categories_Italian, Pizza, Restaurants | 0.01 |
| city_Phoenix | 0.01 |
| name_Sweet Republic | 0.01 |
| name_Mr Hunan | 0.01 |
| name_Domino’s Pizza | 0.01 |
| name_Lobbys Beef Burgers Dogs | 0.01 |
| categories_Steakhouses, Restaurants | 0.01 |
| categories_Gastropubs, Cafes, Restaurants | 0.01 |
| name_Oriental Express | 0.01 |
| categories_Breakfast & Brunch, Restaurants | 0.01 |
| categories_Bars, American (Traditional), Nightlife, Sports Bars, Barbeque, Restaurants | 0.01 |
| name_Rocket Burger & Subs | 0.01 |
| name_Capriottis Sandwich Shop | 0.01 |
| categories_Food, Herbs & Spices, Specialty Food | 0.01 |
| name_Hob Nobs Cafe & Spirits | 0.01 |
| categories_Burgers, American (Traditional), Mexican, Restaurants | 0.01 |
| name_Subway | 0.01 |
| name_Panda Express | 0.01 |
| name_Los Olivos | 0.01 |
| name_Postino Central | 0.01 |
| name_Crazy Buffet | 0.01 |
| name_Binkley’s | 0.01 |
| name_Citizen Public House | 0.01 |
| name_Saketini Japanese Sushi Bar and Lounge | 0.01 |
| name_Macayo’s Mexican Restaurant | 0.01 |
| name_Alexis Family Restaurant | 0.01 |
| categories_Pizza, Restaurants | 0.01 |
| categories_Tex-Mex, Restaurants | 0.01 |
| name_Zipps Sports Grill | 0.01 |
| name_Chicago Hamburger Co | 0.01 |
| name_Fry’s Food Stores | 0.01 |
| name_Cyclo Vietnamese Cuisine | 0.01 |
| name_Jimmy Buffett’s Margaritaville | 0.01 |
| categories_Polish, Scandinavian, Restaurants | 0.01 |
| name_Tommy Bahama Restaurant & Bar - Scottsdale | 0.01 |
| name_Cadillac Ranch | 0.01 |
| name_Majerle’s | 0.01 |
| categories_Burgers, Steakhouses, Barbeque, Restaurants | 0.01 |
| name_Losbetos Mexican Food | 0.01 |
Accuracy metrics for Random Forest:
# make predictions
rf_predict_train <- ml_predict(rf_fit, train)
rf_predict_test <- ml_predict(rf_fit, test)
# calculate accuracy
rf_eval_train <- rf_predict_train %>% ml_multiclass_classification_evaluator(label = "stars",
metric = "accuracy")
## UNABLE TO GET TEST ACCURACY rf_eval_test <- rf_predict_test %>%
## ml_multiclass_classification_evaluator(label = 'stars', metric =
## 'accuracy')
rf_eval_trainFALSE [1] 0.3733378
FALSE NULL
Now that we have the user and Business Rating adjsusted where 0 indicates No Feedback, -1 Indicates Negative Feedback and 1 indicates postive feedback.
I decided to use Jaccard Distance to measure the similarity between Busienss profiles,
Given the size of our data, Spark performed the fastest. However, the results for our three algorithms yielded very similiar results.
Add more to final conlusion. Explain limitations of system. Make recommendations for future improvements.